Loan Approval Modeling and Predictive Analytics
1 Introduction
In today’s rapidly evolving financial sector, the efficient management of loan repayment is crucial for both lending institutions and borrowers. This project represents a strategic initiative to apply the latest in data science and machine learning to enhance the loan approval process. Our data-driven approach utilizes an extensive dataset, encompassing the profiles of over 300,000 loan applicants across 122 variables, to predict loan approval outcomes with a focus on repayment history. The aim is to create a predictive model that not only interprets applicant data but also offers actionable insights for lenders.
The primary objective of our research is to thoroughly analyze applicant characteristics and financial indicators within our dataset. This involves identifying patterns in the repayment history of borrowers, especially those in default, to develop predictive models. These models are designed to be more than statistical tools; they serve as strategic assets that can potentially streamline the loan approval process, enhancing efficiency and accuracy.
Our ambition goes beyond traditional data analysis. We are seeking to transform the way lending institutions make decisions, providing a deeper and more comprehensive risk assessment tool. The goal is to facilitate a shift in loan approval processes, reducing associated risks and improving the overall lending experience for all parties involved. This project is not just about handling data; it’s about innovating in the realm of financial decision-making.
The dataset chosen for this project includes critical variables that shed light on various aspects of an applicant’s profile, such as contract type, gender, car ownership, real estate ownership, and more. Analyzing these variables is crucial for our predictive modeling, offering insights into the diverse backgrounds and circumstances of loan applicants. This analysis is the cornerstone of our approach, aiming to enhance the accuracy and relevance of our loan approval predictions.
Reading the Dataset
train <- fread("application_train.csv", na.strings=c("NA","NaN","?", "","XNA"))
test <- fread("application_test.csv", na.strings=c("NA","NaN","?", "","XNA"),showProgress = FALSE)
dt1 <- fread('application_train.csv', showProgress = FALSE)
app <- read_csv("application_train.csv") %>% mutate(TARGET=as.factor(TARGET))## Rows: 307,511
## Columns: 122
## $ SK_ID_CURR <int> 100002, 100003, 100004, 100006, 100007, 1…
## $ TARGET <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ NAME_CONTRACT_TYPE <chr> "Cash loans", "Cash loans", "Revolving lo…
## $ CODE_GENDER <chr> "M", "F", "M", "F", "M", "M", "F", "M", "…
## $ FLAG_OWN_CAR <chr> "N", "N", "Y", "N", "N", "N", "Y", "Y", "…
## $ FLAG_OWN_REALTY <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", "Y", "…
## $ CNT_CHILDREN <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…
## $ AMT_INCOME_TOTAL <dbl> 202500.00, 270000.00, 67500.00, 135000.00…
## $ AMT_CREDIT <dbl> 406597.5, 1293502.5, 135000.0, 312682.5, …
## $ AMT_ANNUITY <dbl> 24700.5, 35698.5, 6750.0, 29686.5, 21865.…
## $ AMT_GOODS_PRICE <dbl> 351000, 1129500, 135000, 297000, 513000, …
## $ NAME_TYPE_SUITE <chr> "Unaccompanied", "Family", "Unaccompanied…
## $ NAME_INCOME_TYPE <chr> "Working", "State servant", "Working", "W…
## $ NAME_EDUCATION_TYPE <chr> "Secondary / secondary special", "Higher …
## $ NAME_FAMILY_STATUS <chr> "Single / not married", "Married", "Singl…
## $ NAME_HOUSING_TYPE <chr> "House / apartment", "House / apartment",…
## $ REGION_POPULATION_RELATIVE <dbl> 0.018801, 0.003541, 0.010032, 0.008019, 0…
## $ DAYS_BIRTH <int> -9461, -16765, -19046, -19005, -19932, -1…
## $ DAYS_EMPLOYED <int> -637, -1188, -225, -3039, -3038, -1588, -…
## $ DAYS_REGISTRATION <dbl> -3648, -1186, -4260, -9833, -4311, -4970,…
## $ DAYS_ID_PUBLISH <int> -2120, -291, -2531, -2437, -3458, -477, -…
## $ OWN_CAR_AGE <dbl> NA, NA, 26, NA, NA, NA, 17, 8, NA, NA, NA…
## $ FLAG_MOBIL <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_EMP_PHONE <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,…
## $ FLAG_WORK_PHONE <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,…
## $ FLAG_CONT_MOBILE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FLAG_PHONE <int> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,…
## $ FLAG_EMAIL <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ OCCUPATION_TYPE <chr> "Laborers", "Core staff", "Laborers", "La…
## $ CNT_FAM_MEMBERS <dbl> 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, 3, 2, 2, 3,…
## $ REGION_RATING_CLIENT <int> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,…
## $ REGION_RATING_CLIENT_W_CITY <int> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,…
## $ WEEKDAY_APPR_PROCESS_START <chr> "WEDNESDAY", "MONDAY", "MONDAY", "WEDNESD…
## $ HOUR_APPR_PROCESS_START <int> 10, 11, 9, 17, 11, 16, 16, 16, 14, 8, 15,…
## $ REG_REGION_NOT_LIVE_REGION <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ REG_REGION_NOT_WORK_REGION <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ LIVE_REGION_NOT_WORK_REGION <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ REG_CITY_NOT_LIVE_CITY <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ REG_CITY_NOT_WORK_CITY <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ LIVE_CITY_NOT_WORK_CITY <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ ORGANIZATION_TYPE <chr> "Business Entity Type 3", "School", "Gove…
## $ EXT_SOURCE_1 <dbl> 0.08303697, 0.31126731, NA, NA, NA, NA, 0…
## $ EXT_SOURCE_2 <dbl> 0.2629486, 0.6222458, 0.5559121, 0.650441…
## $ EXT_SOURCE_3 <dbl> 0.13937578, NA, 0.72956669, NA, NA, 0.621…
## $ APARTMENTS_AVG <dbl> 0.0247, 0.0959, NA, NA, NA, NA, NA, NA, N…
## $ BASEMENTAREA_AVG <dbl> 0.0369, 0.0529, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BEGINEXPLUATATION_AVG <dbl> 0.9722, 0.9851, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BUILD_AVG <dbl> 0.6192, 0.7960, NA, NA, NA, NA, NA, NA, N…
## $ COMMONAREA_AVG <dbl> 0.0143, 0.0605, NA, NA, NA, NA, NA, NA, N…
## $ ELEVATORS_AVG <dbl> 0.00, 0.08, NA, NA, NA, NA, NA, NA, NA, N…
## $ ENTRANCES_AVG <dbl> 0.0690, 0.0345, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMAX_AVG <dbl> 0.0833, 0.2917, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMIN_AVG <dbl> 0.1250, 0.3333, NA, NA, NA, NA, NA, NA, N…
## $ LANDAREA_AVG <dbl> 0.0369, 0.0130, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAPARTMENTS_AVG <dbl> 0.0202, 0.0773, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAREA_AVG <dbl> 0.0190, 0.0549, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAPARTMENTS_AVG <dbl> 0.0000, 0.0039, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAREA_AVG <dbl> 0.0000, 0.0098, NA, NA, NA, NA, NA, NA, N…
## $ APARTMENTS_MODE <dbl> 0.0252, 0.0924, NA, NA, NA, NA, NA, NA, N…
## $ BASEMENTAREA_MODE <dbl> 0.0383, 0.0538, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BEGINEXPLUATATION_MODE <dbl> 0.9722, 0.9851, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BUILD_MODE <dbl> 0.6341, 0.8040, NA, NA, NA, NA, NA, NA, N…
## $ COMMONAREA_MODE <dbl> 0.0144, 0.0497, NA, NA, NA, NA, NA, NA, N…
## $ ELEVATORS_MODE <dbl> 0.0000, 0.0806, NA, NA, NA, NA, NA, NA, N…
## $ ENTRANCES_MODE <dbl> 0.0690, 0.0345, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMAX_MODE <dbl> 0.0833, 0.2917, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMIN_MODE <dbl> 0.1250, 0.3333, NA, NA, NA, NA, NA, NA, N…
## $ LANDAREA_MODE <dbl> 0.0377, 0.0128, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAPARTMENTS_MODE <dbl> 0.0220, 0.0790, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAREA_MODE <dbl> 0.0198, 0.0554, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAPARTMENTS_MODE <dbl> 0.0000, 0.0000, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAREA_MODE <dbl> 0.0000, 0.0000, NA, NA, NA, NA, NA, NA, N…
## $ APARTMENTS_MEDI <dbl> 0.0250, 0.0968, NA, NA, NA, NA, NA, NA, N…
## $ BASEMENTAREA_MEDI <dbl> 0.0369, 0.0529, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BEGINEXPLUATATION_MEDI <dbl> 0.9722, 0.9851, NA, NA, NA, NA, NA, NA, N…
## $ YEARS_BUILD_MEDI <dbl> 0.6243, 0.7987, NA, NA, NA, NA, NA, NA, N…
## $ COMMONAREA_MEDI <dbl> 0.0144, 0.0608, NA, NA, NA, NA, NA, NA, N…
## $ ELEVATORS_MEDI <dbl> 0.00, 0.08, NA, NA, NA, NA, NA, NA, NA, N…
## $ ENTRANCES_MEDI <dbl> 0.0690, 0.0345, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMAX_MEDI <dbl> 0.0833, 0.2917, NA, NA, NA, NA, NA, NA, N…
## $ FLOORSMIN_MEDI <dbl> 0.1250, 0.3333, NA, NA, NA, NA, NA, NA, N…
## $ LANDAREA_MEDI <dbl> 0.0375, 0.0132, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAPARTMENTS_MEDI <dbl> 0.0205, 0.0787, NA, NA, NA, NA, NA, NA, N…
## $ LIVINGAREA_MEDI <dbl> 0.0193, 0.0558, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAPARTMENTS_MEDI <dbl> 0.0000, 0.0039, NA, NA, NA, NA, NA, NA, N…
## $ NONLIVINGAREA_MEDI <dbl> 0.0000, 0.0100, NA, NA, NA, NA, NA, NA, N…
## $ FONDKAPREMONT_MODE <chr> "reg oper account", "reg oper account", N…
## $ HOUSETYPE_MODE <chr> "block of flats", "block of flats", NA, N…
## $ TOTALAREA_MODE <dbl> 0.0149, 0.0714, NA, NA, NA, NA, NA, NA, N…
## $ WALLSMATERIAL_MODE <chr> "Stone, brick", "Block", NA, NA, NA, NA, …
## $ EMERGENCYSTATE_MODE <chr> "No", "No", NA, NA, NA, NA, NA, NA, NA, N…
## $ OBS_30_CNT_SOCIAL_CIRCLE <dbl> 2, 1, 0, 2, 0, 0, 1, 2, 1, 2, 0, 0, 0, 0,…
## $ DEF_30_CNT_SOCIAL_CIRCLE <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ OBS_60_CNT_SOCIAL_CIRCLE <dbl> 2, 1, 0, 2, 0, 0, 1, 2, 1, 2, 0, 0, 0, 0,…
## $ DEF_60_CNT_SOCIAL_CIRCLE <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DAYS_LAST_PHONE_CHANGE <dbl> -1134, -828, -815, -617, -1106, -2536, -1…
## $ FLAG_DOCUMENT_2 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_3 <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
## $ FLAG_DOCUMENT_4 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_5 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_6 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ FLAG_DOCUMENT_7 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_8 <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_9 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_10 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_11 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_12 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_13 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_14 <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_15 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_16 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_17 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_18 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_19 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_20 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FLAG_DOCUMENT_21 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AMT_REQ_CREDIT_BUREAU_HOUR <dbl> 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_DAY <dbl> 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_WEEK <dbl> 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_MON <dbl> 0, 0, 0, NA, 0, 0, 1, 0, 0, NA, 1, 0, 1, …
## $ AMT_REQ_CREDIT_BUREAU_QRT <dbl> 0, 0, 0, NA, 0, 1, 1, 0, 0, NA, 0, 0, 0, …
## $ AMT_REQ_CREDIT_BUREAU_YEAR <dbl> 1, 0, 0, NA, 0, 1, 2, 0, 1, NA, 0, 2, 0, …
## application_train : ( 307511 , 122 )
a=colnames(train)
b=colSums(is.na(train)) %>% as.data.table
missing_value_table=cbind(a,b)
colnames(missing_value_table)=c("variables","Missing_values")
missing_value_table = missing_value_table %>% filter(Missing_values>0) %>%
mutate("% of Total Values" = round(100 * (Missing_values / nrow(train)),1)) %>%
arrange(desc(Missing_values))
cat("Your selected dataframe has" , ncol(train) , " columns.\n")## Your selected dataframe has 122 columns.
## There are 69 columns that have missing values.
In our analysis of the application_train dataset, comprising 307,511 rows and 122 columns, we identified that 69 columns have missing values. Notably, discrete and continuous columns constitute 13.1% and 86.9%, respectively. Fortunately, there are no entirely missing columns, and 2.8% of rows are complete. However, 24.5% of observations have some missing values. These findings offer a holistic view of the dataset’s composition and highlight areas requiring attention, ensuring a more informed and strategic approach to our analysis and decision-making processes.
## variables Missing_values % of Total Values
## 1: COMMONAREA_AVG 214865 69.9
## 2: COMMONAREA_MODE 214865 69.9
## 3: COMMONAREA_MEDI 214865 69.9
## 4: NONLIVINGAPARTMENTS_AVG 213514 69.4
## 5: NONLIVINGAPARTMENTS_MODE 213514 69.4
## 6: NONLIVINGAPARTMENTS_MEDI 213514 69.4
## 7: FONDKAPREMONT_MODE 210295 68.4
## 8: LIVINGAPARTMENTS_AVG 210199 68.4
## 9: LIVINGAPARTMENTS_MODE 210199 68.4
## 10: LIVINGAPARTMENTS_MEDI 210199 68.4
## 11: FLOORSMIN_AVG 208642 67.8
## 12: FLOORSMIN_MODE 208642 67.8
## 13: FLOORSMIN_MEDI 208642 67.8
## 14: YEARS_BUILD_AVG 204488 66.5
## 15: YEARS_BUILD_MODE 204488 66.5
Examining the Target Variable
ggplot(train,aes(TARGET))+
geom_bar(fill = "blue",alpha=0.3)+
geom_text(aes(label =scales::percent(..count../sum(..count..))),stat = 'count',vjust = -0.5)+
scale_y_continuous(label = comma)The bar plot of the TARGET variable in the indicates a highly imbalanced class problem, with 92% of loans falling into the category of timely repayments and only 8% representing defaults. This observation underscores the need to address the data imbalance issue for effective model training. To mitigate this, we assigned appropriate weights to the classes during the training phase, ensuring that the model is not skewed toward the majority class and can make more accurate predictions for both repayment outcomes.
train%>%
select_if(is.character)->carData
colNames<-names(carData)
carData$TARGET<-train$TARGET
carData$SK_ID_CURR<-train$SK_ID_CURR
for (name in colNames){
p<-distColumn(train,name,'TARGET','SK_ID_CURR')
plot(p)
}The visual analysis of loan categories based on applicant characteristics reveals several key observations:
Cash Loans Dominance: The majority of loans taken were in the form of Cash Loans, and a significant portion of these loans were repaid on time.
Occupational Impact: Different occupational groups exhibit varying repayment behaviors. Working individuals, state servants, and commercial associates took a considerable number of loans, with a default rate of roughly 5% or less. In contrast, laborers secured the highest number of loans and displayed timely repayment.
Sector Disparities: Loans were less frequent among individuals in IT and HR sectors, suggesting a lower demand or eligibility within these professional domains.
The graphical representations provide a succinct overview, presenting both the percentage and count of loans taken, along with a breakdown of timely repayments and defaults. These insights can guide strategic decision-making for lending institutions, aiding in risk assessment and resource allocation.
train=train %>% mutate_if(is.character, list(~factor(.)))
train=train %>% mutate_if(is.integer, list(~as.numeric(.)))
train %>% select_if(is.factor) %>% summarise_all(n_distinct) %>% t()## [,1]
## NAME_CONTRACT_TYPE 2
## CODE_GENDER 3
## FLAG_OWN_CAR 2
## FLAG_OWN_REALTY 2
## NAME_TYPE_SUITE 8
## NAME_INCOME_TYPE 8
## NAME_EDUCATION_TYPE 5
## NAME_FAMILY_STATUS 6
## NAME_HOUSING_TYPE 6
## OCCUPATION_TYPE 19
## WEEKDAY_APPR_PROCESS_START 7
## ORGANIZATION_TYPE 58
## FONDKAPREMONT_MODE 5
## HOUSETYPE_MODE 4
## WALLSMATERIAL_MODE 8
## EMERGENCYSTATE_MODE 3
## DAYS_BIRTH
## Min. :-25229
## 1st Qu.:-19682
## Median :-15750
## Mean :-16037
## 3rd Qu.:-12413
## Max. : -7489
## DAYS_BIRTH
## Min. :20.52
## 1st Qu.:34.01
## Median :43.15
## Mean :43.94
## 3rd Qu.:53.92
## Max. :69.12
## DAYS_EMPLOYED
## Min. :-17912
## 1st Qu.: -2760
## Median : -1213
## Mean : 63815
## 3rd Qu.: -289
## Max. :365243
ggplot(train,aes(DAYS_EMPLOYED))+
geom_histogram(fill = "blue",alpha=0.3)+
scale_x_continuous(label = comma)+
scale_y_continuous(label = comma) anom= train %>% filter(DAYS_EMPLOYED==365243)
non_anom= train %>% filter(DAYS_EMPLOYED!=365243)
sprintf('There are %d anomalous days of employment' , nrow(anom) )## [1] "There are 55374 anomalous days of employment"
In our analysis, we discovered a group of anomalous values in the “DAYS_EMPLOYED” column where all instances had a value of 365,243. Surprisingly, loans associated with these anomalies exhibited a lower default rate (5.40%) compared to non-anomalous cases(8.66%). Given this intriguing finding, handling these anomalies becomes crucial. We opted for a cautious approach, transforming the anomalous values to NaN and introducing a new boolean column to signify the original anomaly status. This preprocessing step aligns the employment days distribution more closely with expectations. Our strategy involves filling NaN values,acknowledging the original anomaly status. The transformation aims to enhance the model’s understanding, paving the way for more accurate predictions.
ggplot(train,aes(DAYS_EMPLOYED))+
geom_histogram(fill = "blue",alpha=0.3)+
scale_x_continuous(label = comma)+
scale_y_continuous(label = comma)+
ggtitle("Days Employment Histogram")Effect of Age on Repayment
## TARGET
## DAYS_BIRTH -0.07823931
“DAYS_BIRTH” variable exhibits the most positive correlation with loan repayment. This variable represents the age of the client at the time of the loan, expressed in negative days. The positive correlation implies that, counterintuitively, as clients get older (i.e., the absolute value of “DAYS_BIRTH” increases), they are less likely to default on their loans. To clarify, we transformed the variable to its absolute value, revealing a negative correlation of approximately -0.0782. This negative correlation indicates that, in general, as clients age, there is a tendency for them to repay their loans on time more frequently. Visualizing this relationship through a histogram of client ages in years provides a clearer understanding of the age-related dynamics in loan repayment behavior.
ggplot(train, aes(DAYS_BIRTH /365))+
geom_histogram(fill = "blue",alpha=0.3,bins = 25)+
scale_x_continuous(label = comma)+
scale_y_continuous(label = comma)+
theme_light() +
labs(title = "Age of Client", x = "Age (years)")+
theme(plot.title = element_text(hjust = .5))On its own, analyzing the age distribution provides limited insights, mainly confirming the absence of outliers as all recorded ages appear reasonable. To gain a deeper understanding of how age influences the target variable, we employed a Kernel Density Estimation plot (KDE) that incorporates color differentiation based on the target values.
A Kernel Density Estimate plot serves as a valuable visualization tool, illustrating the distribution of a single variable akin to a smoothed histogram. This plot is generated by computing a kernel, at each data point and subsequently averaging these individual kernels to produce a unified, smooth curve.
In the forthcoming analysis, we will leverage the ggplot KDE plot to create an insightful graph that sheds light on the relationship between age and the target variable. The color-coded KDE plot will provide a nuanced depiction of how age dynamics correlate with the target, offering a more comprehensive view of the impact of age on loan repayment behavior.
ggplot(train, aes(DAYS_BIRTH /365,colour=as.factor(TARGET),group=as.factor(TARGET)))+
geom_density()+
scale_x_continuous(label = comma)+
scale_y_continuous()+
theme_light() +
labs(title = "Distribution of Ages", x = "Age (years)")+
theme(plot.title = element_text(hjust = .5))We generated a density plot illustrating the distribution of ages, color-coded by the target variable. The plot reveals that the curve corresponding to the target variable (TARGET == 1) skews towards the younger age range. Although the correlation coefficient is not notably high (-0.07), the variable remains valuable for machine learning models due to its discernible impact on the target.
In a complementary analysis, the age data is binned into 5-year intervals, allowing for a closer examination of the relationship between age and loan repayment. By calculating the average value of the target variable within each age bracket, the graph provides insights into the proportion of loans that were not repaid in different age categories. This approach offers a nuanced perspective on the influence of age on the likelihood of loan repayment failure.
age_data <- train %>%
select(TARGET, DAYS_BIRTH) %>%
mutate(YEARS_BIRTH = DAYS_BIRTH / 365)
# Bin the age data
age_data$YEARS_BINNED <- cut(age_data$YEARS_BIRTH, breaks = seq(20, 70, by = 5))
head(age_data, 10)## TARGET DAYS_BIRTH YEARS_BIRTH YEARS_BINNED
## 1: 1 9461 25.92055 (25,30]
## 2: 0 16765 45.93151 (45,50]
## 3: 0 19046 52.18082 (50,55]
## 4: 0 19005 52.06849 (50,55]
## 5: 0 19932 54.60822 (50,55]
## 6: 0 16941 46.41370 (45,50]
## 7: 0 13778 37.74795 (35,40]
## 8: 0 18850 51.64384 (50,55]
## 9: 0 20099 55.06575 (55,60]
## 10: 0 14469 39.64110 (35,40]
# Group by the bin and calculate averages
age_data %>% group_by(YEARS_BINNED) %>% summarise_all(mean)## # A tibble: 10 × 4
## YEARS_BINNED TARGET DAYS_BIRTH YEARS_BIRTH
## <fct> <dbl> <dbl> <dbl>
## 1 (20,25] 0.123 8533. 23.4
## 2 (25,30] 0.111 10155. 27.8
## 3 (30,35] 0.103 11855. 32.5
## 4 (35,40] 0.0894 13708. 37.6
## 5 (40,45] 0.0785 15498. 42.5
## 6 (45,50] 0.0742 17324. 47.5
## 7 (50,55] 0.0670 19196. 52.6
## 8 (55,60] 0.0553 20984. 57.5
## 9 (60,65] 0.0527 22781. 62.4
## 10 (65,70] 0.0373 24293. 66.6
age_data %>% group_by(YEARS_BINNED) %>% summarise_all(mean) %>%
ggplot(aes(x=YEARS_BINNED, y=TARGET*100))+
geom_col(fill = "blue",alpha=0.3)+
labs(title = "Failure to Repay by Age Group", x = "Age Group (years)", y= 'Failure to Repay (%)')+
theme(plot.title = element_text(hjust = .5))The presented visualization and analysis of the grouped age data clearly highlight a discernible trend: younger applicants exhibit a higher likelihood of failing to repay loans. Specifically, the failure to repay rates surpass 10% for the three youngest age categories, contrasting with rates below 5% for the oldest age group.
This actionable insight carries direct implications for the bank’s decision-making processes. The information suggests that younger clients may benefit from additional support, guidance, or financial planning tips to enhance their repayment capabilities. Importantly, this recommendation is not advocating for discriminatory practices but rather underscores the prudence of implementing precautionary measures to assist younger clients in meeting their payment obligations more effectively. While age alone may not be a decisive factor, incorporating it into machine learning models can contribute to more nuanced credit risk assessments.
Exterior Sources Analysis
The analysis of exterior data sources reveals three variables, namely EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3, with the most notable negative correlations with the target variable. As per the documentation, these features represent normalized scores derived from external data sources, potentially constituting a cumulative credit rating based on diverse data inputs.
ext_data = train[,c('TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH')]
ext_data_corrs = cor(ext_data, use = "pairwise")
ext_data_corrs## TARGET EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH
## TARGET 1.00000000 -0.1553171 -0.16047167 -0.1789187 -0.07823931
## EXT_SOURCE_1 -0.15531713 1.0000000 0.21398238 0.1868455 0.60060997
## EXT_SOURCE_2 -0.16047167 0.2139824 1.00000000 0.1091673 0.09199587
## EXT_SOURCE_3 -0.17891870 0.1868455 0.10916729 1.0000000 0.20547760
## DAYS_BIRTH -0.07823931 0.6006100 0.09199587 0.2054776 1.00000000
## Var1 Var2 value
## 1 TARGET TARGET 1.00000000
## 2 EXT_SOURCE_1 TARGET -0.15531713
## 3 EXT_SOURCE_2 TARGET -0.16047167
## 4 EXT_SOURCE_3 TARGET -0.17891870
## 5 DAYS_BIRTH TARGET -0.07823931
## 6 TARGET EXT_SOURCE_1 -0.15531713
The correlation matrix illustrates the relationships between these external sources and the target, as well as their intercorrelations. Notably, EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3 all exhibit negative correlations with the target, suggesting that higher scores from these external sources are associated with a lower likelihood of loan default. Moreover, examining the correlations between the external sources themselves and with other variables like DAYS_BIRTH provides a comprehensive view of their interconnectedness.
These findings imply that these external sources, capturing additional information beyond the immediate loan application, play a crucial role in predicting the likelihood of timely loan repayment. In a practical sense, these external data features can serve as valuable inputs for machine learning models, contributing to more accurate assessments of creditworthiness and aiding in risk mitigation for the lending institution.
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson\nCorrelation") +
labs(title='Correlation Heatmap')+
theme(plot.title = element_text(hjust = .5), axis.text.x = element_text(angle=90),
axis.title.x=element_blank(),
axis.title.y=element_blank())+
coord_fixed()+
geom_text(aes(Var2, Var1, label = round(value,2)), color = "black", size = 3)The presented correlation heatmap visually encapsulates the relationships between different variables, particularly highlighting the Pearson correlation coefficients. The three EXT_SOURCE features exhibit negative correlations with the target, implying that higher EXT_SOURCE values are associated with a greater likelihood of loan repayment. Additionally, there’s a positive correlation between DAYS_BIRTH and EXT_SOURCE_1, suggesting that client age might influence this scoring mechanism.
To delve deeper into the impact of these features on loan repayment, the distribution of each variable is explored, color-coded by the target value. This approach enables a more nuanced understanding of how each variable influences the likelihood of loan default. Such insights are crucial for refining credit risk assessment models and tailoring strategies to enhance the overall repayment rates for clients.
p1=ggplot(train, aes(EXT_SOURCE_1,colour=as.factor(TARGET),group=as.factor(TARGET)))+
geom_density()+
scale_x_continuous(breaks=seq(0,1,by=0.2))+
scale_y_continuous()+
theme_light() +
labs(title = "Distribution of EXT_SOURCE_1 by Target Value", x = "EXT_SOURCE_1")+
theme(plot.title = element_text(hjust = .5))
p2=ggplot(train, aes(EXT_SOURCE_2,colour=as.factor(TARGET),group=as.factor(TARGET)))+
geom_density()+
scale_x_continuous(breaks=seq(0,1,by=0.2))+
scale_y_continuous()+
theme_light() +
labs(title = "Distribution of EXT_SOURCE_2 by Target Value", x = "EXT_SOURCE_2")+
theme(plot.title = element_text(hjust = .5))
p3=ggplot(train, aes(EXT_SOURCE_3,colour=as.factor(TARGET),group=as.factor(TARGET)))+
geom_density()+
scale_x_continuous(breaks=seq(0,1,by=0.2))+
scale_y_continuous()+
theme_light() +
labs(title = "Distribution of EXT_SOURCE_3 by Target Value", x = "EXT_SOURCE_3")+
theme(plot.title = element_text(hjust = .5))
grid.arrange(p1,p2,p3,nrow=3)The visualizations of EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3 distributions, stratified by the target value, reveal interesting insights. While all three features exhibit subtle differences between the target values, EXT_SOURCE_3 stands out as having the most pronounced distinction. This suggests that EXT_SOURCE_3 holds a discernible relationship with the likelihood of loan repayment. Despite the correlations being characterized as weak, these variables, including EXT_SOURCE_1 and EXT_SOURCE_2, remain valuable inputs for a machine learning model. Leveraging these features in predictive models can contribute to assessing and forecasting an applicant’s repayment behavior, enhancing the precision of credit risk evaluations.
2 Feature Engineering
In optimizing predictor variables for enhanced machine learning model performance, a systematic feature engineering process was employed. This involved centering and scaling to standardize predictor values, mitigating skewness by applying logarithmic, square root, or inverse transformations to numeric variables, and employing Box-Cox transformation to further enhance non-normal distributions. Lambda estimates were derived for each predictor, indicating the extent of transformation. Pre-processing steps incorporated Box-Cox transformations for specific columns, contributing to improved data normality. The impact of these transformations was visually assessed through histograms, providing a clear before-and-after view. These efforts aim to create a more robust set of predictor variables, potentially boosting the effectiveness of subsequent machine learning models.
#Skew Values
skewValues <- as.data.frame(apply(dt1_num, 2, function(x) skewness(x, na.rm = TRUE)))
colnames(skewValues)[1] <- "skew_values"
skewValues <- index_to_col(skewValues,'Column')
skewValues <- setDT(skewValues)[order (skew_values, decreasing = TRUE)]
top_15 <- head(skewValues, 15)
print(top_15)## Column skew_values
## 1: FLAG_DOCUMENT_12 392.11095
## 2: AMT_INCOME_TOTAL 391.55583
## 3: FLAG_DOCUMENT_10 209.58701
## 4: FLAG_DOCUMENT_2 153.79032
## 5: AMT_REQ_CREDIT_BUREAU_QRT 134.36426
## 6: FLAG_DOCUMENT_4 110.89328
## 7: FLAG_DOCUMENT_7 72.17340
## 8: FLAG_DOCUMENT_17 61.21354
## 9: FLAG_DOCUMENT_21 54.61241
## 10: FLAG_DOCUMENT_20 44.36446
## 11: FLAG_DOCUMENT_19 40.95573
## 12: FLAG_DOCUMENT_15 28.69905
## 13: AMT_REQ_CREDIT_BUREAU_DAY 27.04320
## 14: FLAG_DOCUMENT_14 18.37235
## 15: FLAG_DOCUMENT_13 16.75358
BoxCoxValues <- apply(dt1_num, 2, function(x) BoxCoxTrans(x, na.rm = TRUE))
x = list()
for (i in 1:ncol(dt1_num)){
lambda <- BoxCoxValues[[i]][[1]]
x[[i]] <- lambda
}
lambda = do.call(rbind, x)
lambda_df <- as.data.frame(cbind(colnames(dt1_num),lambda))
colnames(lambda_df)[1] <- "Column"
colnames(lambda_df)[2] <- "lambda"
knitr::kable(setDT(lambda_df)[!is.na(lambda)])| Column | lambda |
|---|---|
| SK_ID_CURR | 0.7 |
| AMT_INCOME_TOTAL | -0.0999999999999999 |
| AMT_CREDIT | 0.2 |
| AMT_ANNUITY | 0.2 |
| AMT_GOODS_PRICE | 0.2 |
| REGION_POPULATION_RELATIVE | 0.3 |
| CNT_FAM_MEMBERS | 0.2 |
| REGION_RATING_CLIENT | 1 |
| REGION_RATING_CLIENT_W_CITY | 1 |
| EXT_SOURCE_1 | 0.9 |
| EXT_SOURCE_2 | 1.5 |
| EXT_SOURCE_3 | 1.2 |
## Created from 11351 samples and 28 variables
##
## Pre-processing:
## - Box-Cox transformation (12)
## - ignored (16)
##
## Lambda estimates for Box-Cox transformation:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1000 0.2000 0.5000 0.6083 1.0000 1.5000
col_trans <- lambda_df[!is.na(lambda)]$Column
i = 5
x <- list(
title = as.character(col_trans[i])
)
p1 <- plot_ly(x = ~setDT(dt1)[,get(as.character(col_trans[i]))], type = "histogram", autobinx = FALSE) %>% layout(showlegend = FALSE)
p2 <- plot_ly(x = ~setDT(dt1_tran)[,get(as.character(col_trans[i]))], type = "histogram", autobinx = FALSE) %>% layout(showlegend = FALSE)
subplot(p1,p2)Before Transformation
doPlots(as.data.frame(dt1)[, (colnames(dt1) %in% as.character(col_trans))], plotHist, ii = 1:length(col_trans))After Transformation
doPlots(as.data.frame(dt1_tran)[, (colnames(dt1_tran) %in% as.character(col_trans))], plotHist, ii = 1:length(col_trans))Removing Predictors
nzv <- nearZeroVar(dt1,saveMetrics= TRUE)
nzv <- index_to_col(nzv,"Column")
nzv_tb <- setDT(nzv)[nzv == TRUE | zeroVar ==TRUE]
head(nzv_tb, 15)## Column freqRatio percentUnique zeroVar nzv
## 1: DAYS_EMPLOYED 354.96154 4.0889594193 FALSE TRUE
## 2: FLAG_MOBIL 307510.00000 0.0006503832 FALSE TRUE
## 3: FLAG_CONT_MOBILE 534.73345 0.0006503832 FALSE TRUE
## 4: REG_REGION_NOT_LIVE_REGION 65.03199 0.0006503832 FALSE TRUE
## 5: LIVE_REGION_NOT_WORK_REGION 23.59498 0.0006503832 FALSE TRUE
## 6: BASEMENTAREA_AVG 58.74502 1.2292243204 FALSE TRUE
## 7: LANDAREA_AVG 82.53968 1.1469508408 FALSE TRUE
## 8: NONLIVINGAREA_AVG 107.57326 1.0698804270 FALSE TRUE
## 9: BASEMENTAREA_MODE 61.70260 1.2490610092 FALSE TRUE
## 10: LANDAREA_MODE 83.90865 1.1586577391 FALSE TRUE
## 11: NONLIVINGAREA_MODE 140.72537 1.0819125170 FALSE TRUE
## 12: BASEMENTAREA_MEDI 55.31734 1.2266227875 FALSE TRUE
## 13: LANDAREA_MEDI 80.80711 1.1576821642 FALSE TRUE
## 14: NONLIVINGAREA_MEDI 113.08720 1.0806117505 FALSE TRUE
## 15: FLAG_DOCUMENT_2 23653.69231 0.0006503832 FALSE TRUE
df_corr = cor(dt1_num2, use = "pairwise.complete.obs")
hc = findCorrelation(df_corr, cutoff=0.80)
hc = sort(hc)
dt1_num3 = as.data.frame(dt1_num2)[,-c(hc)]
rm_col_hc <- setdiff(colnames(dt1_num2),colnames(dt1_num3))
rm_col_hc## [1] "AMT_ANNUITY" "AMT_GOODS_PRICE"
## [3] "FLAG_EMP_PHONE" "REGION_RATING_CLIENT_W_CITY"
## [5] "REG_REGION_NOT_WORK_REGION" "REG_CITY_NOT_WORK_CITY"
## [7] "APARTMENTS_AVG" "BASEMENTAREA_AVG"
## [9] "YEARS_BEGINEXPLUATATION_AVG" "YEARS_BUILD_AVG"
## [11] "COMMONAREA_AVG" "ELEVATORS_AVG"
## [13] "ENTRANCES_AVG" "FLOORSMAX_AVG"
## [15] "FLOORSMIN_AVG" "LANDAREA_AVG"
## [17] "LIVINGAPARTMENTS_AVG" "LIVINGAREA_AVG"
## [19] "NONLIVINGAPARTMENTS_AVG" "NONLIVINGAREA_AVG"
## [21] "APARTMENTS_MODE" "YEARS_BUILD_MODE"
## [23] "LIVINGAREA_MODE" "APARTMENTS_MEDI"
## [25] "BASEMENTAREA_MEDI" "YEARS_BEGINEXPLUATATION_MEDI"
## [27] "COMMONAREA_MEDI" "ELEVATORS_MEDI"
## [29] "ENTRANCES_MEDI" "FLOORSMAX_MEDI"
## [31] "FLOORSMIN_MEDI" "LANDAREA_MEDI"
## [33] "LIVINGAPARTMENTS_MEDI" "LIVINGAREA_MEDI"
## [35] "NONLIVINGAPARTMENTS_MEDI" "NONLIVINGAREA_MEDI"
## [37] "TOTALAREA_MODE" "OBS_30_CNT_SOCIAL_CIRCLE"
## [39] "DEF_30_CNT_SOCIAL_CIRCLE"
We removed all the columns identified as highly correlated and/or nzv. Added Predictors and changed the categorical to dummy variables.
dt1_preproc <- cbind(dt1_non_num_dum,dt1_num)
mv <- as.data.frame(apply(dt1_preproc, 2, function(col)sum(is.na(col))/length(col)))
colnames(mv)[1] <- "missing_values"
mv <- index_to_col(mv,'Column')
mv <- setDT(mv)[order (missing_values, decreasing = TRUE)]
ggplot (mv[1:40,], aes (reorder(Column, missing_values), missing_values)) + geom_bar (position = position_dodge(), stat = "identity") + coord_flip () + xlab('Columns') + ylab('Missing Value %')We used the Recursive Feature Elimination method to select the variables in order to decrease computational time going forward.
Recursive Feature Extraction
control <- rfeControl(functions=rfFuncs, method="cv", number=3)
trainctrl <- trainControl(classProbs= TRUE, summaryFunction = twoClassSummary)
results <- rfe(as.data.frame(dt1_preproc_sample)[,-c(153)],as.data.frame(dt1_preproc_sample)[,c(153)], sizes=c(1:100), rfeControl=control, method="rf",metric = "AUC", trControl = trainctrl)
print(results)##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (3 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 4064 0.1408 3357 35.36 0.015490 35.47
## 2 3743 0.2572 3019 30.74 0.016769 45.54
## 3 3479 0.3542 2830 46.82 0.025566 71.31
## 4 3231 0.4542 2620 11.41 0.014291 48.68
## 5 2898 0.5755 2340 72.13 0.015319 62.65
## 6 2807 0.5807 2184 87.82 0.020661 99.44
## 7 2818 0.5783 2212 75.45 0.017376 64.63
## 8 2812 0.5804 2212 74.20 0.016274 65.11
## 9 2735 0.6005 2095 71.01 0.016058 74.68
## 10 2715 0.6067 2085 56.62 0.011686 56.60
## 11 2687 0.6151 2067 73.00 0.017082 67.89
## 12 2693 0.6124 2036 65.80 0.016328 73.92
## 13 2696 0.6116 2045 60.57 0.014360 67.76
## 14 2682 0.6159 2033 47.49 0.009743 48.93
## 15 2690 0.6132 2024 46.25 0.011474 42.49
## 16 2678 0.6167 2021 45.06 0.010037 40.99
## 17 2671 0.6188 2026 53.48 0.012171 51.23
## 18 2669 0.6196 2021 61.07 0.013225 54.66
## 19 2661 0.6218 2020 62.67 0.014114 56.53
## 20 2657 0.6235 2020 54.97 0.011844 55.71
## 21 2650 0.6254 2013 61.02 0.013371 52.39
## 22 2641 0.6282 2012 49.55 0.009886 43.01
## 23 2634 0.6307 2011 50.53 0.009198 42.17
## 24 2634 0.6304 2008 56.10 0.010650 50.61
## 25 2631 0.6315 2010 54.94 0.010583 50.70
## 26 2631 0.6316 2010 51.95 0.010157 47.85
## 27 2633 0.6309 2009 55.38 0.011148 53.75
## 28 2628 0.6326 2006 64.23 0.013452 59.92
## 29 2630 0.6321 2011 57.55 0.011792 57.82
## 30 2624 0.6336 2000 57.92 0.012394 58.29
## 31 2616 0.6361 2000 61.99 0.012320 60.49
## 32 2628 0.6327 2006 63.51 0.013601 63.74
## 33 2628 0.6325 2009 68.97 0.015640 67.15
## 34 2625 0.6338 2006 59.60 0.012973 62.75
## 35 2621 0.6349 2008 69.60 0.014960 67.46
## 36 2619 0.6352 2001 65.37 0.014276 61.51
## 37 2620 0.6352 2003 69.09 0.015192 66.94
## 38 2615 0.6367 1996 58.95 0.012172 63.43
## 39 2617 0.6362 2000 60.28 0.012249 59.08
## 40 2614 0.6370 1998 52.74 0.010478 52.14
## 41 2611 0.6380 1998 61.35 0.012543 55.99
## 42 2611 0.6379 1996 65.21 0.013467 56.19 *
## 43 2617 0.6362 2001 63.15 0.012902 56.78
## 44 2613 0.6375 1998 63.06 0.012878 56.06
## 45 2618 0.6358 2000 66.94 0.014301 58.91
## 46 2611 0.6379 1997 61.21 0.012222 51.02
## 47 2619 0.6356 2005 70.34 0.014979 60.60
## 48 2621 0.6350 2003 71.22 0.014996 61.89
## 49 2617 0.6362 2002 66.43 0.013624 55.39
## 50 2620 0.6353 2005 68.16 0.014362 55.60
## 51 2616 0.6363 1999 66.58 0.013774 57.71
## 52 2618 0.6359 2000 71.13 0.015636 57.32
## 53 2617 0.6363 2000 72.91 0.015630 61.88
## 54 2618 0.6358 2000 71.87 0.015619 64.28
## 55 2621 0.6352 2006 72.04 0.015820 63.18
## 56 2616 0.6369 2003 65.12 0.013557 58.01
## 57 2619 0.6357 2002 63.71 0.013607 55.07
## 58 2623 0.6345 2009 63.88 0.013597 53.42
## 59 2625 0.6338 2007 60.41 0.012600 48.83
## 60 2626 0.6335 2008 65.62 0.013935 56.65
## 61 2620 0.6354 2004 60.88 0.012429 49.38
## 62 2621 0.6351 2007 67.15 0.014499 53.20
## 63 2623 0.6345 2007 62.79 0.013840 50.32
## 64 2622 0.6348 2007 66.28 0.014270 52.00
## 65 2619 0.6356 2007 68.48 0.014806 56.23
## 66 2618 0.6362 2005 69.88 0.015706 56.63
## 67 2616 0.6368 2003 65.83 0.014198 55.90
## 68 2618 0.6362 2005 62.17 0.013090 51.95
## 69 2619 0.6357 2006 63.15 0.013844 56.82
## 70 2619 0.6357 2007 65.46 0.014815 52.99
## 71 2614 0.6372 2003 59.53 0.012807 52.19
## 72 2612 0.6378 2000 58.60 0.013109 51.49
## 73 2614 0.6373 2003 60.26 0.013081 56.32
## 74 2620 0.6356 2006 66.28 0.014651 58.37
## 75 2615 0.6369 2000 66.40 0.015123 58.54
## 76 2614 0.6372 2002 76.75 0.017644 65.08
## 77 2613 0.6378 2002 64.45 0.014467 57.25
## 78 2621 0.6352 2005 73.00 0.016605 60.59
## 79 2620 0.6356 2007 62.19 0.013770 54.50
## 80 2623 0.6348 2007 62.16 0.013861 53.21
## 81 2615 0.6372 2003 59.47 0.013216 49.43
## 82 2618 0.6363 2008 60.58 0.013463 54.53
## 83 2618 0.6363 2008 57.41 0.012670 52.62
## 84 2621 0.6353 2007 64.15 0.014311 56.72
## 85 2623 0.6347 2008 63.85 0.014651 56.11
## 86 2616 0.6370 2003 60.25 0.012473 53.03
## 87 2621 0.6352 2007 55.66 0.012055 49.01
## 88 2617 0.6366 2007 59.98 0.012853 51.22
## 89 2621 0.6357 2008 58.24 0.011931 51.65
## 90 2622 0.6351 2010 58.98 0.012051 48.56
## 91 2617 0.6368 2006 58.76 0.012735 51.74
## 92 2618 0.6365 2007 58.49 0.012773 49.33
## 93 2619 0.6360 2006 57.09 0.011955 47.58
## 94 2616 0.6369 2006 56.37 0.012014 50.23
## 95 2617 0.6367 2006 65.79 0.014677 55.18
## 96 2617 0.6367 2008 61.93 0.013284 54.38
## 97 2618 0.6364 2007 62.62 0.013128 52.62
## 98 2614 0.6377 2006 57.79 0.011503 48.28
## 99 2616 0.6372 2005 61.26 0.013096 48.19
## 100 2616 0.6370 2005 63.96 0.013596 50.85
## 184 2617 0.6369 2008 58.15 0.011469 48.10
##
## The top 5 variables (out of 42):
## EXT_SOURCE_1, DAYS_REGISTRATION, DAYS_ID_PUBLISH, CNT_CHILDREN, ORGANIZATION_TYPEXNA
cols_to_keep <- c('FLAG_OWN_CARN','`ORGANIZATION_TYPEIndustry: type 1`','DAYS_ID_PUBLISH','SK_ID_CURR','REG_CITY_NOT_LIVE_CITY','YEARS_BEGINEXPLUATATION_MODE','COMMONAREA_MODE','FLOORSMAX_MODE','LIVINGAPARTMENTS_MODE','YEARS_BUILD_MEDI','CODE_GENDERM','OCCUPATION_TYPEWaiters/barmen staff','TARGET','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','CNT_CHILDREN')
dt1_preproc_sample <- as.data.frame(dt1_preproc_sample)[, (colnames(dt1_preproc_sample) %in% cols_to_keep)]
dt1_preproc <- as.data.frame(dt1_preproc)[, (colnames(dt1_preproc) %in% cols_to_keep)]
cols_to_keep## [1] "FLAG_OWN_CARN" "`ORGANIZATION_TYPEIndustry: type 1`"
## [3] "DAYS_ID_PUBLISH" "SK_ID_CURR"
## [5] "REG_CITY_NOT_LIVE_CITY" "YEARS_BEGINEXPLUATATION_MODE"
## [7] "COMMONAREA_MODE" "FLOORSMAX_MODE"
## [9] "LIVINGAPARTMENTS_MODE" "YEARS_BUILD_MEDI"
## [11] "CODE_GENDERM" "OCCUPATION_TYPEWaiters/barmen staff"
## [13] "TARGET" "EXT_SOURCE_1"
## [15] "EXT_SOURCE_2" "EXT_SOURCE_3"
## [17] "CNT_CHILDREN"
These are the final features we selected for our model.
## [1] 1 0
dt1_preproc$TARGET <- ifelse(dt1_preproc$TARGET == 0, 'Yes', 'No')
dt1_preproc$TARGET <- as.factor(dt1_preproc$TARGET)
# Create training and testing data partitions
inTrain <- createDataPartition(dt1_preproc$TARGET, p = 0.8, list = FALSE)
dtTrain <- dt1_preproc[inTrain, ]
dtTest <- dt1_preproc[-inTrain, ]3 Model Building
traincntrl <- trainControl(method = 'repeatedcv',
number = 5,
repeats = 2,
classProbs = TRUE,
sampling = "down",
summaryFunction = twoClassSummary)Support Vector Machine(SVM Radial)
svmFit <- train(TARGET ~.,
data = subset_train_data,
method = 'svmRadial',
preProc = c('center','scale'),
tuneLength = 3,
trControl = traincntrl)
svmFit## Support Vector Machines with Radial Basis Function Kernel
##
## 24601 samples
## 15 predictor
## 2 classes: 'No', 'Yes'
##
## Pre-processing: centered (15), scaled (15)
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 19680, 19681, 19681, 19681, 19681, 19680, ...
## Addtional sampling using down-sampling prior to pre-processing
##
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.7300545 0.6795065 0.6716560
## 0.50 0.7269650 0.6706891 0.6687155
## 1.00 0.7216793 0.6588592 0.6741764
##
## Tuning parameter 'sigma' was held constant at a value of 0.06342473
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.06342473 and C = 0.25.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3212 18543
## Yes 1753 37994
##
## Accuracy : 0.67
## 95% CI : (0.6663, 0.6737)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1255
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.64693
## Specificity : 0.67202
## Pos Pred Value : 0.14764
## Neg Pred Value : 0.95590
## Prevalence : 0.08073
## Detection Rate : 0.05223
## Detection Prevalence : 0.35373
## Balanced Accuracy : 0.65947
##
## 'Positive' Class : No
##
In employing the Support Vector Machine (SVM) with a Radial Basis Function Kernel for classification tasks, the model was trained on a dataset comprising 24,601 samples with 15 predictor variables. The SVM utilized preprocessing techniques involving centering and scaling. The tuning parameters, particularly the cost parameter (C) and the radial basis function parameter (sigma), were optimized through cross-validated resampling. The selected model, with C set at 0.25 and sigma at 0.0634, demonstrated a Receiver Operating Characteristic (ROC) of 0.73. Subsequent evaluation on a test set of data (notably imbalanced) yielded a confusion matrix indicating an accuracy of 67%. Sensitivity and specificity were 64.69% and 67.20%, respectively, suggesting a trade-off between correctly identifying positive cases and minimizing false positives. The model’s prevalence and detection rates were also assessed, contributing to a comprehensive evaluation of its classification performance.
KNN
The k-Nearest Neighbors (KNN) algorithm was employed for binary classification, assuming that similar instances exist in close proximity.
knnFit <- train(TARGET ~.,
data = subset_train_data,
method = "knn",
preProc = c("center", "scale"),
metric = "Accuracy",
tuneGrid = data.frame(.k = 1:20),
trControl = traincntrl)
knnFit$results## k ROC Sens Spec ROCSD SensSD SpecSD
## 1 1 0.5787709 0.5848468 0.5726951 0.018295566 0.03241870 0.007727375
## 2 2 0.6068607 0.5772831 0.5716339 0.009981657 0.02150961 0.005909824
## 3 3 0.6345142 0.6223631 0.5945169 0.013106437 0.02059142 0.010046121
## 4 4 0.6414925 0.6180924 0.5919965 0.014459362 0.02477989 0.007446866
## 5 5 0.6559403 0.6289021 0.6113862 0.010888943 0.02321467 0.014508864
## 6 6 0.6567551 0.6276395 0.6018351 0.012154432 0.01276619 0.014003844
## 7 7 0.6679057 0.6442629 0.6126907 0.012070177 0.01641820 0.011643629
## 8 8 0.6726300 0.6437553 0.6139288 0.011869174 0.02287752 0.007904402
## 9 9 0.6782973 0.6457736 0.6214681 0.013764469 0.02351186 0.011484245
## 10 10 0.6827816 0.6538220 0.6183728 0.012353530 0.01690464 0.010838505
## 11 11 0.6825743 0.6497956 0.6307097 0.013536951 0.02605246 0.010819747
## 12 12 0.6902388 0.6633849 0.6289410 0.017913308 0.02951769 0.015877317
## 13 13 0.6888980 0.6470299 0.6343577 0.015250052 0.02320823 0.014425426
## 14 14 0.6913659 0.6598572 0.6283440 0.015418786 0.03006231 0.010133325
## 15 15 0.6914607 0.6636324 0.6313730 0.017881663 0.02928959 0.016104205
## 16 16 0.6954432 0.6616299 0.6269069 0.016671694 0.02096529 0.015075127
## 17 17 0.6996474 0.6769768 0.6304002 0.015738800 0.02403170 0.011118322
## 18 18 0.6946415 0.6543258 0.6377183 0.012996049 0.02287255 0.020672140
## 19 19 0.6982778 0.6560814 0.6403051 0.013811924 0.02607474 0.013130324
## 20 20 0.6983789 0.6656646 0.6372098 0.011480325 0.02545053 0.013973171
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3176 20842
## Yes 1789 35695
##
## Accuracy : 0.632
## 95% CI : (0.6282, 0.6358)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0985
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.63968
## Specificity : 0.63136
## Pos Pred Value : 0.13223
## Neg Pred Value : 0.95227
## Prevalence : 0.08073
## Detection Rate : 0.05164
## Detection Prevalence : 0.39052
## Balanced Accuracy : 0.63552
##
## 'Positive' Class : No
##
The model was trained on a subset of the data with preprocessing steps involving centering and scaling. A range of values for k (number of neighbors) was explored, and the optimal k was determined based on accuracy. The KNN model with the best performance had k=17, achieving an accuracy of 63.2%. The confusion matrix revealed a balanced sensitivity and specificity, indicating a moderate ability to correctly identify both positive and negative cases. The KNN algorithm demonstrated a sensitivity of 63.97%, specificity of 63.14%, and an overall balanced accuracy of 63.55%.
Logistic Regression
logisticReg <- train(TARGET ~.,
data = dtTrain,
method = 'glm',
metric="Accuracy",
trControl = traincntrl)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
predictions <- predict(logisticReg, newdata = dtTest, type = "raw")
confusionMatrix(predictions, dtTest$TARGET)## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 3301 18750
## Yes 1664 37787
##
## Accuracy : 0.6681
## 95% CI : (0.6643, 0.6718)
## No Information Rate : 0.9193
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1297
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.66485
## Specificity : 0.66836
## Pos Pred Value : 0.14970
## Neg Pred Value : 0.95782
## Prevalence : 0.08073
## Detection Rate : 0.05367
## Detection Prevalence : 0.35854
## Balanced Accuracy : 0.66661
##
## 'Positive' Class : No
##
roc_curve <- roc(dtTest$TARGET, as.numeric(predictions == "Yes")) # Assuming "Yes" is the positive class## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
Logistic Regression was employed for binary classification, offering a balance between speed and accuracy compared to other techniques. The logistic regression model achieved an accuracy of 66.81% on the test dataset. The confusion matrix revealed balanced sensitivity and specificity, indicating a moderate ability to correctly classify both positive and negative instances. The sensitivity was 66.49%, specificity was 66.84%, and the balanced accuracy reached 66.66%. The model’s performance is characterized by a Kappa value of 0.1297, suggesting a fair agreement beyond chance. Additionally, the Receiver Operating Characteristic (ROC) curve analysis further assesses the model’s ability to discriminate between classes.
4 Model Evaluation and Comparison
Support Vector Machine (SVM): SVM, leveraging a radial basis function kernel, achieved a commendable 67% accuracy on the test dataset. Demonstrated a balanced sensitivity (64.69%) and specificity (67.20%). Offers a strategic tool for anticipating and streamlining loan approval outcomes. The tuned model used sigma = 0.06342473 and C = 0.25.
k-Nearest Neighbors (KNN): KNN algorithm, considering 20 different k values, achieved a 63.2% accuracy on the test dataset. Balanced sensitivity (63.97%) and specificity (63.14%). Presents a potential approach for discerning patterns in loan approval outcomes.
Logistic Regression: Logistic Regression, known for its computational efficiency, achieved an accuracy of 66.81% on the test dataset. Balanced sensitivity (66.49%) and specificity (66.84%). Strikes a balance between speed and accuracy in comparison to more complex models.
5 Conclusion and Recommendations
Each model presents strengths and trade-offs in accuracy and computational efficiency.
SVM, with its 67% accuracy, stands out for robust predictive modeling, though computational intensity should be considered.
KNN, while less accurate, offers an alternative perspective, potentially uncovering unique patterns.
Logistic Regression, with its 66.81% accuracy, is an attractive option due to its balance between speed and accuracy.
Further refinement and exploration of ensemble approaches could enhance predictive power. We recommend exploration and refinement to achieve the desired paradigm shift in financial decision-making based on the strengths and trade-offs of each model in accuracy and computational efficiency to achieve an optimal and balanced predictive model that not only minimizes risks associated with loan approvals but also revolutionizes the decision-making landscape for lending institutions, aligning with our overarching goal of transforming financial practices and fostering a more secure and informed lending experience.
Our journey extends beyond data; it’s about transforming the landscape of financial decision-making, fostering an enriched lending experience for all stakeholders.